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ABSTRACT 

AnnotQTL is a web tool designed to aggregate func- 
tional annotations from different prominent web 
sites by minimizing the redundancy of information. 
Although thousands of QTL regions have been 
identified in livestock species, most of them are 
large and contain many genes. This tool was there- 
fore designed to assist the characterization of 
genes in a QTL interval region as a step towards 
selecting the best candidate genes. It localizes the 
gene to a specific region (using NCBI and Ensembl 
data) and adds the functional annotations available 
from other databases (Gene Ontology, Mammalian 
Phenotype, HGNC and Pubmed). Both human gen- 
ome and mouse genome can be aligned with the 
studied region to detect synteny and segment con- 
servation, which is useful for running inter-species 
comparisons of QTL locations. Finally, custom 
marker lists can be included in the results display 
to select the genes that are closest to your most 
significant markers. We use examples to demon- 
strate that in just a couple of hours, AnnotQTL is 
able to identify all the genes located in regions 
identified by a full genome scan, with some high- 
lighted based on both location and function, thus 
considerably increasing the chances of finding good 
candidate genes. AnnotQTL is available at http:// 
annotqtl.genouest.org. 

INTRODUCTION 

The final steps of genetic mapping research programs 
require close analysis of several QTL regions to select can- 
didate genes for further studies. Despite several websites 



(NCBI genome browser, Ensembl Browser, UCSC 
Genome Browser) or web tools (Biomart, Galaxy) de- 
veloped to achieve this task, the selection of candidate 
genes remains a laborious process. The information 
made available on the more prominent web sites differs 
slightly in terms of gene prediction and functional anno- 
tation, while other websites provide extra information that 
researchers may want to use (HGNC approved gene sym- 
bols, Gene Ontology (GO) Annotation or functional data, 
conservation of synteny with other species, etc.). It is pos- 
sible to manually merge and compare this information for 
one QTL containing few genes, but not for many different 
QTL regions containing dozens of genes. 

Here, we propose a web tool that, for a given region of 
interest, merges the list of genes available in NCBI and 
Ensembl, removes redundancy, adds functional annota- 
tions from different prominent web sites, and highlights 
the genes for which functional annotation fits the biologic- 
al function or diseases of interest. The tool is dedicated to 
sequenced species of livestock including cattle, pig, 
chicken and horse as well as dog, i.e. species that have 
been extensively studied (with over 8000 QTLs detected; 
see http://www.animalgenome.org/cgi-bin/QTLdb/index). 
Nevertheless, because of the family designs and the low 
number of animals used in these species, most of the 
studies use linkage analysis, and the QTL regions ident- 
ified remain large (containing dozens of genes). 
Conversely, in human and model species, most analyses 
now draw heavily on association studies involving large 
cohorts, thus providing more power and accuracy, and the 
web tools already available focus on these species through 
functional annotation of SNPs in association with the trait 
(1-8). Most of these tools focus on the SNP annotation 
itself, describing whether the SNP is located in a gene, or 
even in a coding sequence, and defining if it could have a 
functional effect. While these web tools are highly efficient 
in providing a good annotation for specific SNPs, they 
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clearly cannot be used to collect information on the large 
regions obtained in livestock species. 



METHODS 

The main objective of AnnotQTL is to minimize redun- 
dancy so as to display the maximum amount of informa- 
tion from several sources on the genes in the region of 
interest. The main AnnotQTL program is implemented 
in PERL. The data are downloaded from several FTPs 
or websites (see Figure 1 for details on the data and 



fields used) and stored on our server for further computa- 
tion. Location and annotation data from Ensembl are 
downloaded via BioMart (9) using MartService. 
AnnotQTL gives several sets of information from com- 
parative mapping of selected species against the human 
and mouse genomes using a local dump of the data 
provided by the Narcisse web site (10) and orthologous 
gene information from the Ensembl comparative database. 
PERL scripts import the downloaded files into our SQL 
databases. All PERL scripts and official GO database up- 
dates are inserted into a BioMaj (1 1) workflow to automate 
the updating process. Updates are performed monthly. 



_J map seq 

tax Jd VARCHAR(20) 

chr VARCHAR(30) 

chr_startlNT(11) 
,chr_endlNT(11) 
J chr_orient VARCHAR(3) 

GenelD VARCHAR(20) 

feature typo VARCHAR(20) 



_] gene2ensembl 

tax_kJ VARCHAR(20) 

GenelD VARCHAR(20) 

ENS G VARCHAR(20) 

RNA accession VARCHAR(20) 

ENS T VARCHAR(20) 

PR accession VARCHAR(20) 

ENS PVARCHAR(20) 



_] gene2accession 

tax id VARCHAR(20) 
GenelD VARCHAR(20) 
RNA accession VARCHAR(200) 
PR ^accession VARCHAR(200) 



□ gene2go 

tax id VARCHAR(20) 
. GenelD VARCHAR(20) 
GO ID VARCHAR(20) 
. Evidence VARCHAR(5) 
Qualifier VARCHAR(10) 
GO term VARCHAR(120) 
Pubmed VARCHAR(120) 
GO_classVARCHAR(2) 
► 



□ gene info 

tax id VARCHAR(20) 
GenelD VARCHAR(20) 
Symbol VARCHAR(20) 
Synonyms VARCHAR(BO) 
dbXrefs LONGTEXT 
chr VARCHAR(30) 
description LONGTEXT 
. type of gene VARCHAR(60) 
Symbol nomen VARCHAR(20) 
Full_name_nomen LONGTEXT 
Nomen status VARCHARI5) 
Other_designations LONGTEXT 



_] gene2pubmed ▼ 

tax_id VARCHAR(20) 
GenelD VARCHAR(20) 
Pubmed ID VARCHAR(20) 



_] mp2mp 

, parent id VARCHARI20) 
child id VARCHAR(20) 



I mp2term 

id VARCHAR(20) 
i name VARCHAR(IOOO) 

definition VARCHAR(SOOO) 
, synonym VARCHAR(5000) 



□ biomart gene2annot f 

tax id VARCHAR(20) 
ensID VARCHAR(20) 
Gene name VARCHAR(20) 
description LONGTEXT 
chr VARCHAR(30) 

,chr_startlNT(11) 

^chr_end INT(11) 

) chr_orient VARCHAR(3) 



_] biomart gene2ortholog ▼ 

tax id VARCHAR(20) 
ensID VARCHAR(20) 
HSA_enslD VARCHAR(20) 
HSA_orthotype VARCHAR(60) 
MMU ensID VARCHAR(20) 
MMU orthotype VARCHAR(60) 



_] mp2assoc 

hgnc symbol VARCHAR(20) 
GenelD VARCHARI20) 
MMU symbol VARCHAR(20) 
MGI id VARCHAR(20) 
MP id VARCHAR(tOOO) 



I hgnc 

hgnc .id INT(11) 
appr_symbol VARCHAR(20) 

3 appr .name VARCHAR(IOOO) 
status VARCHAR(40) 

J locusjype VARCHAR(IOO) 

,prev symbol VARCHAR( 1000) 

, prev name LONGTEXT 
aliases VARCHAR(1000| 
name_aliases LONGTEXT 

) acc„number VARCHAR(IOOO) 

,enzjd VARCHAR(60) 
entrez_ id VARCHAR(30) 
ensembl id VARCHAR(20) 
pubmed_id VARCHAR(500) 

- refseq_id VARCHAR(20) 
omim id INT(11) 



~ narcisse synt 

taxid query VARCHAR(20) 
) taxidjarget VARCHAR(20) 
,synt_orderTINYINT(4) 

element id INT(t1) 
, chr_query VARCHAR(30) 

start query INT(1 1) 
) end query INT(1 1) 

sign VARCHARI3) 

chr target VARCHAR(30) 

start target INT(11) 

end target INT(11) 



□ snp 

rsVARCHAR(20) 
chrVARCHAR(30) 

. chr_pos INT(t1| 

. locaLloci VARCHAR(30) 



_] omim_genemap 

Symbol VARCHAR(200) 
title LONGTEXT 
MIM number INT(11) 
Disorders LONGTEXT 



Figure 1. Schematic diagram of the database and source data files. The table map_seq is filled using file xxx_seq_gene.md.gz, where xxx is the species 
name, located in the NCBI FTP directory: /genomes/xxx/mapview. The tables gene_info, gene2accession, gene2go, gene2ensembl and gene2pubmed 
are filled using data files stored in the NCBI FTP directory: /gene/DATA. The table biomart_xxx is filled using the BioMart service for the Ensembl 
databases. For each species, a SQL table is created to store SNP data (here, only one is detailed). The data files are downloaded from NCBI FTP 
directory: /snp/organisms/xxx/chr_rpts (where xxx is species name). The tables mp2mp, mp2term and mp2assoc are filled using files 
HMD_HumanPhenotype.rpt and MPheno_OBO. ontology available from the MGI FTP site (ftp.informatics.jax.org/pub/reports). The hgnc table 
is filled using the data stored at the GeneName website together with its LWP agent (http://www.genenames.org/cgi-bin/hgnc_downloads.cgi). The 
table omim_genemap is filled using the data file located in the NCBI FTP directory /repository/OMIM. The table narcisse_synt is filled using the 
comparative data provided by the Narcisse website (http://narcisse.toulouse.inra.fr). 
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The principle and workflow of AnnotQTL is depicted 
in Figure 2. Starting with the genome coordinates 
entered by the user, the program extracts the NCBI 
GenelD of genes contained in the region, the corres- 
ponding annotations (name, description, symbol and so 
on), plus the associated cross-references (RNA acces- 
sion number, protein accession number and Ensembl iden- 
tifier) and the Pubmed identifier. This Pubmed identifier 
is specific of the requested species and does not list 
the publications related to this gene in other species. 
Using the same genome coordinates, the program then 
extracts Ensembl ID, gene annotation and human and 
mouse ortholog gene identifiers from the Ensembl 
database. 

The main step now is to remove the redundancy between 
NCBI and Ensembl data while keeping the specific anno- 
tation of both databases. As there is a slight difference in 
gene location between the two web sites, the filtering 
process cannot be based on gene location, which leaves 
two approach options. The first is to use the Ensembl 
cross-reference provided by NCBI. However, this ap- 
proach is not exhaustive since few cross-references are 
missing even for genes annotated in both databases. A 
second strategy has therefore been developed based on a 
textual query search in the annotation fields provided by 
the two sites. Values for the symbol, synonyms, RNA ac- 
cession and protein accession fields from NCBI are 
compared against the values in the Ensembl gene name 
field for the gene of the species of interest. When one 
or more of these fields match, all the information is 
combined under one record, thus removing duplicates 
and enriching the annotation (without losing the annota- 
tion specific to both sites). If available, the gene annota- 
tions of human and mouse orthologs are also included 
in this comparison. Each record is also filtered for poten- 
tial intra-redundancy of annotation between the gene and 
its orthologs (i.e. the same gene description is found 
between the requested species and Human or mouse 
orthologs). This set of genes combining NCBI- and 
Ensembl-specific information is then compared to the 
HGNC database. The goal of this procedure is to retrieve 
the HGNC approved symbol by searching for corres- 
pondences between annotation fields and the symbol or 
aliases fields of the HGNC database. Then, the values 
found in the HGNC database (symbol and OMIM iden- 
tifier, if any) are included in the final results output 
displayed. If the OMIM identifier is still undefined, a 
search through the OMIM symbol fields is performed 
using the HGNC symbol or aliases. Where relevant, 
OMIM identifier, title and related disorders are retrieved 
from the OMIM database. Finally, the user-specified 
genome region is cross-compared against the Narcisse 
database to fetch the human or mouse-orthologous 
genomic region. 

To clarify the output and adapt it to the scientist's 
query, certain information is only available through 
menu options. Human and mouse orthology information 
from Ensembl can be used to more accurately define 
certain genomic regions left undefined in Narcisse data. 
Users can also select level of synteny (synt order, see 
(10) for more details) between studied species and target 



species (human or mouse). Another option is to upload a 
set of genetic markers (which can be of any type provided 
physical location is given) to be inserted in the final results 
display. User can choose to keep their own marker loca- 
tions or re-map markers to NCBI genome coordinates 
(only available with approved marker identifiers). A 
fourth non-processed column is available for displaying 
user-defined information. Adding the markers to the re- 
sults display should ease the identification of the genes 
that most closely match the most significant markers. 
Finally, AnnotQTL can highlight genes based on func- 
tional annotations provided by GO, Mammalian 
Phenotype (MP), or OMIM disorders. For GO or MP 
terms organized in a hierarchical 'parent-children' direc- 
tory structure, user-inputted keywords provide options 
for selecting the corresponding terms and associated 
children. For OMIM, a query is performed against 
OMIM disorders data retrieved in the previous step with 
user-input keywords: if the keywords matched, then the 
OMIM disorders are highlighted in the display. For GO, 
the genes are highlighted if their GenelD matches with 
the GO association provided by NCBI. As they do not 
have a GenelD, the match-up between GO annotation 
and genes specific to the Ensembl database is based on 
their HGNC name, where available. Users can improve 
this 4 GO A highlight function' by adding the GO A from 
human and/or mouse species from orthologs to current 
genes. For MP, genes are highlighted if their approved 
symbols match the HGNC approved symbols stored in 
the MP database. The aim here is to provide functional 
information and facilitate the identification of genes 
linked to the trait-of-interest (i.e. functional candidate 
genes). 



APPLICATION 

To demonstrate the utility of AnnotQTL and test the ef- 
ficiency of this web tool, we present different examples 
using real data aimed at identifying functional and pos- 
itional candidate genes. 

The first example focuses on a bovine mutation 
controlling muscular hypertrophy. In 1995, the mutation 
was mapped to the extremity of the BTA2 in a 12 cM 
interval (12). Using AnnotQTL on this region of the 
bovine chromosome (0-8 Mb) retrieved the location and 
functional annotation of 95 genes. We then applied the 
'GO highlight function' on this region in two separate 
queries, using 'muscle' and 'growth' as keywords best 
describing the observed phenotype. These two terms high- 
lighted two and three genes, respectively, from these 95 
genes. Both lists highlighted the MSTN (GDF8) gene, 
which has been demonstrated as the validated causal 
gene (myostatin) (13). 

A second step analyzed a more extensive set of 21 QTL 
regions shaping abdominal fatness in chickens (14,15). 
Average length of these regions was 4.8 Mb. After 
running AnnotQTL, all the regions were enriched with 
genes by comparing NCBI and Ensembl information 
against information provided by either NCBI or 
Ensembl only (Table 1). For all the genomic regions, 
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Figure 2. AnnotQTL — principle and workflow. Boxes shaded in gray represent user input or database (i.e. Narcisse) input. Boxes shaded in yellow 
show the main processes in the AnnotQTL workflow. Boxes shaded in orange represent intermediate results. MP: Mammalian Phenotype. 
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Table 1. Statistics of the QTL/eQTL regions analyzed using AnnotQTL 

Number of regions Regions mean size NCBI genes 

(Mb) 



QTL 21 4.8 1734 

eQTL 25 3.4 1198 



working from an initial set of 1734 genes from the NCBI 
database and 1902 genes from the Ensembl database, 
AnnotQTL retrieved a non-redundant set of 2220 genes. 
On this large dataset, we applied the 'highlight function' 
on each region to underline genes whose functional anno- 
tation was related to the studied phenotype. Among the 
2220 genes located in these 21 QTL regions, 127 were 
highlighted using the GO term, 'lipid' and the MP term 
'adipose' as keywords, with an average 5.4 genes high- 
lighted per region. 

Finally, AnnotQTL can also be exploited to look at 
eQTL regions. Strategies combining transcriptomics and 
genotyping data have recently been developed to better 
characterize QTL regions for traits of interest by identify- 
ing co-localized eQTLs and QTLs (16-21). Whatever the 
context, this strategy identifies a much higher number of 
eQTL regions than in QTL studies, thus creating a need 
for tools that can efficiently find positional and functional 
candidate genes. Here, we focus on 25 chicken eQTL 
regions affecting 70 genes involved in lipid metabolism 
(i.e. sharing the GO term GO:0006629 'lipid metabolic 
process'). Average length of these regions is 3.4Mb. 
Running AnnotQTL found similar results to those 
obtained for the QTL regions. All the regions were 
enriched with genes by comparing NCBI and Ensembl 
information against information provided by either 
NCBI or Ensembl only (Table 1): working from an 
initial set of 1,198 genes from the NCBI database and 
1283 genes from the Ensembl database, AnnotQTL 
retrieved a non-redundant set of 1506 genes. Again, in 
order to select possible candidate genes, we used the 'high- 
light function' to pinpoint the genes related to the studied 
phenotype. Among these 1506 genes, and using the same 
GO term 'lipid' and MP term 'adipose' as keywords, a 
total of 93 genes were identified, with an average 3.7 
genes highlighted per region. 

These examples corresponding to two different contexts 
(QTL and eQTL analyses) clearly demonstrate how in just 
a couple of hours, AnnotQTL can accurately analyze the 
gene content of numerous regions identified by a full 
genome scan and go on to highlight some of these genes 
based on both their location and function, whereas in the 
same time period, a manually run procedure would only 
have been able to analyze one single region. 

CONCLUSION 

AnnotQTL is a web tool designed to gather the functional 
annotation of different prominent web sites while 
minimizing redundant information. Using all known 



Ensembl genes AnnotQTL genes GO and MP terms 

obtained merging screening results 

NCBI and Ensembl 

Genes Average of genes 

found found per region 



1902 2220 127 5.8 

1283 1506 93 3.7 



information substantially accelerates the gene analysis of 
QTL regions for livestock species traits and improves the 
selection of candidate genes. 
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